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! i ' We describe collaboration networks consisting of research projects funded by the European Union 

i and the organizations involved in those projects. The networks are of substantial size and complexity, 

but are important to understand due to the significant impact they could have on research policies 
and national economies in the EU. In empirical determinations of the network properties, we observe 
characteristics similar to other collaboration networks, including scale-free degree distributions, 
small diameter, and high clustering. We present some plausible models for the formation and 
structure of networks with the observed properties. 
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I ■ I. INTRODUCTION 
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t/3 ' Real world network analysis has become a major issue of research in the last years. Most prominent are perhaps the 

investigations of the structure of the World Wide Web, the network of internet routers, and certain social networks 

O like citation networks. On the theoretical side, one tries to understand the mechanisms of formation of such networks 
• T— ( . ... 

and to derive statistical properties of the networks from the generating rules. On the rigorous mathematical side, 

there are only a few results for specific models, indicating the difficulty of a purely mathematical approach (for a 

\ survey of recent results in this direction, see 0). Thus, the main approach is to use some mean field assumption to 

get relevant information about the corresponding graphs. Although it is not clear where the limits of this approach 

lie, in many cases the results match well with numerical simulations and empirical data. 

In this article, we study a particular collaboration network. Its vertices are research projects funded by the 

European Union and the organizations involved in those projects. In total, the data base contains over 20000 projects 

and 35000 participating organizations. The network shows all the main characteristics known from other complex 

network structures, such as scale-free degree distribution, small diameter, high clustering, and inhomogeneous vertex 

' correlations. 

Besides the general interest in studying a new, real- world network of large size and high complexity, the study 
l/^ ' could have a significant economic impact. Improving collaboration between actors involved in innovation processes 
is a key objective of current science, technology, and innovation policy in industrialized countries. However, very 
little is known about what kind of network structures emerge from such initiatives. Moreover, it is quite likely that 
network structure affects network functions such as knowledge creation, knowledge diffusion, and the collaboration of 
particular types of actors. Presumably, this is determined by both endogenous formation mechanisms and exogenous 
framework conditions. In order to progress in our understanding, it is therefore essential to have sound statistics on 
the structure of networks we observe and to develop plausible models of how these are formed and evolve over time. 

The model networks we use to compare with the empirical data are random intersection graphs, a natural framework 
for describing projections of bipartite graphs. Discrete intersection graphs similar to the ones we use were first discussed 
in 0. We extend and refine the construction from [j| to be more applicable to real world graphs. 

Perhaps the most important finding from our model approach is the strong determination of the real network 
structure by the degree distribution. That is, most statistical properties we measure in the EU research project 
networks are the ones observed in a typical realization of a uniform weighted random graph model with given (bipartite) 
degree distribution as in the EU networks. Since this distribution is characterized by two exponents — one for each 
partition — we have essentially only four parameters (size, edge number, and exponents) which are needed to describe 
the entire network. This is a tremendous reduction of complexity indicating that only a few basic formation rules are 
driving the network evolution. 

In section [H] we describe the preparation of the data on the EU research programs. We present empirical de- 
termination of the network properties in section fTTTl followed by an explanation of these properties using a random 
intersection graph model in section Hvl Finally, in section we summarize the key results and consider implications 
of the network properties on EU research programs. 
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II. THE DATA SET 

In this work, we study research collaboration networks that have emerged in the European Union's first four 
successive four-year Framework Programs (FPs) on Research and Technological Development. Since their inception 
in 1984, six FPs have been launched, on four of which we have comprehensive data. FPs are organized in priority 
areas, which include information and communication technologies (ICTs), energy, industrial technologies, life sciences, 
environment, transportation, and a number of additional activities. In line with economic structural change, the main 
thematic focus of the FPs has shifted somewhat over time from energy and industrial technologies to the application 
of ICTs and life sciences. The majority of funding activities are aimed at stimulating research partnerships between 
firms, universities, research organizations, governmental actors, NGOs, lobby groups, etc.. Since FP4, the scope of 
activities has been expanded to also cover training, networking, demonstration, and preparatory activities (for details, 
see reference pj). In order to keep our data set compatible over the different FPs, we have excluded the latter set of 
projects from FP4 and only focus on collaborative research projects (see table |1J. 

In order to receive funding, projects in FP1 to FP4 had to comprise at least two organizations from at least two 
member states. We have retrieved data on these projects from the publicly available CORDIS (Community Research 
and Development Information Service) projects database [Toj . This database contains information on all funded 
projects as well as a reasonably complete listing of all participating organizations. 

The raw data on participating organizations is rather inconsistent. Apart from incoherent spelling in up to four 
languages per country, organizations are labelled inhomogeneously. Entries may range from large corporate groupings, 
such as Siemens, or large public research organizations like the Spanish CSIC to individual departments or labs and 
are listed as valid at the time the respective project was carried out. Among heterogeneous organizations, only a 
subset contains information on the unit actually participating or on geographical location (address, city, region and/or 
country). Information on older entries and the substructure of firms tends to be less complete. 

Because of these difficulties, any automatic standardization method akin to the one utilized by Newman [|| is 
inappropriate to this kind of data. Rather, the raw data has to be cleaned and completed manually, which is an 
ongoing project at ARC systems research. The objective of this work is to produce a data set useful for policy advice by 
identifying homogeneous, economically meaningful organizational entities. To this end, organizational boundaries are 
defined by legal control and entries are assigned to the respective organizations. Resulting heterogeneous organizations, 
such as universities, large research centres, or conglomerate firms are broken down into subentities that operate in 
fairly coherent areas of activity, such as faculties, institutes, divisions or subsidiaries. These can be identified for 
a large number of entries, based on the available contact information of participants, and are comparable across 
organizations. 

The case of the French Centre National de la Recherche Scientifique (CNRS), the most active participant in the 
EU FPs may serve as an illustration. First, 785 separate entries were summarized under a unique organizational 
label. Next, these 785 entries were broken down into the eight areas of research activity in which CNRS is currently 
organized. Based on available information on participating units and geographical location, 732 of the 785 entries 
could be assigned to one of these subentities. For the remaining 53 entries, the nonspecific label CNRS was used. 

Comparable success rates were achieved for other large public research organizations and universities. Due to 
scarcer information, firms could not be broken down at a comparable rate. Moreover, due to resource constraints, 
standardization work has focused on the major players in the FPs. Organizations participating in fewer than a total 
of 30 projects in FP1-4 have not been broken down yet. Due to these limitations in processing the data, we cannot 
rule out the possibility of a bias in analysing our data. However, we have run all the reported analyses with the 
undivided organizations and have obtained qualitatively similar results, apart from different extreme values, e.g., 
maximum degree. 

Table^displays information on the present data set, which contains information on a total of 27,758 projects, carried 
out over the period 1984 to 2004. It shows that the total budget as well as number of funded projects has increased 
dramatically from FP1 to FP4. Moreover, it provides a rough measure on the completeness of the available data. For 
a sizeable number of projects, the CORDIS project database lists information only on the project co-ordinator. This 
is due to the age of the data and inhomogeneous disclosure policies of different units at the European Commission. 
Comparing the number of projects containing information on more than one participant with the total number of 
projects funded in each FP shows that the data is fairly complete as of FP2. 

The fact that FP1 was the first program launched and that the available data are rather incomplete make it 
exceptional in many respects. We therefore focus our analyses on FP2-4 and only give graph characteristic values for 
FP1 to indicate the difference to the networks created by the subsequent FPs. 
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III. THE NETWORK STRUCTURE 

In this section, we present the basic properties of the network structure for projects and organizations in the first 
four EU Framework Programs. We consider both graphs as intersection graphs, each being the dual of the other, 
which, for our purposes, is generally more convenient than the usual bipartite-graph point of view. Recall that an 
intersection graph is given by an enumerated collection of sets — the vertices of the intersection graph — with elements 
from a given fixed base-set and edges defined via the intersection property (edge = nonempty intersection of two sets). 
The sets need not be distinct. 

We denote by V — [P\\ ■■■ ■ Pm} the family of projects and by O — {0\\ On} the family of organizations. Projects 
are understood as labeled sets of organizations and organizations as labeled sets of projects. The corresponding 
intersection-graphs arc denoted by Gp and Go and we will sometimes use the terms P-graph and O-graph for them. 
The size of a vertex x from Gp or Go is the cardinality of the set corresponding to the vertex; in the picture 
of bipartite graphs, the size is just the degree of the vertex. In tables |n] and II I II we give some basic parameters 
measured on the P- and O-graphs from the four Framework Programs. Since the degree distribution for P-graphs 
is a superposition of two power-law distributions (one for small degree values and one for large values), we give the 
corresponding values for the exponents parenthetically. 

As expected, FP1-4 are of small world type: high clustering coefficient and small diameter of the giant component. 
There is a slight increase in the clustering coefficient of the O-graphs from FP1 to FP4, indicating a stronger integration 
amongst groups of collaborating organizations. This is also reflected in the mean project size which increases from 
2.4 to 6.2. There is an interesting jump in the P-graph mean degree values and the mean triangle numbers between 
FP1 and 2 and between FP2 and 3. The maximal degree of the O-graphs are very high in comparison with the mean 
degree, which is a consequence of the power law degree structure. For the P-graphs, the gap between mean and 
maximal degree is less pronounced. 

More information is contained in the statistical properties of the relevant distributions. The numerical data strongly 
indicate that the size distributions follow power laws. Also, the O-graph degree distribution is of power-law type, while 
the project-graph degree distribution is a superposition of two scale free distributions, one dominating the distribution 
for small degree values (up to 100) and one relevant for the large degree values. We discuss these properties at greater 
length in the following sections. 

A. Size distributions 



The size distributions are the basic distributions for the EU-networks since, as will be shown in section section llV Bl a 
typical sample from the random graph space with fixed size distributions like in FP 2-4 will have very similar statistical 
properties to FP 2-4. This strongly suggests that there is essentially no additional correlation in the data once the size 
distribution is known. Both the O-graph and P-graph size distributions show clear asymptotic power law distributions 
for FP1-4 (figs, n and |2| . In terms of the corresponding bipartite graph, these are just the degree distributions of 
the project and organization partitions. While the O-graph size distribution is of power law type over the whole size 
range, the P-graph size distribution deviates strongly from the power-law for small size values. In section llVl we give 
a possible explanation for the appearance of the power law distribution for size. 

The numerical values for the exponents of the organization size distributions from FP2-4 are slightly below 2, 
but constant within the error tolerance. This indicates that the distribution of organizations able to carry out a 
particular number of projects has not changed in the three Framework Programs. A complementary interpretation 
of this finding is that the underlying research activities, which we know to have changed over time, have not altered 
the mix of organizations participating in a particular number of projects in each Framework Program. It is further 
worth noting that the values of the O-graph exponents are close to the critical value 2, hence the size expectation 
could diverge for large graphs (whether the value is really below 2 or not is still unclear due to the error tolerance ). 

The picture is similar for the P-graphs, although there are some differences in the initial behavior (that is, for small 
project sizes) and in the exponent value. The local minima at size 2 is decreasing from FP2-4. This points to the 
existence of an optimal project size within the regime of the EU FPs. Moreover, the rise in the average project size 
indicates that increases in the available funding from FP2 to FP4 not only lead to more projects, but also slightly 
larger projects. This is consistent with recommendations from evaluation studies and the stated attempts of the 
EU commission to reduce its administrative burden. As a whole, the size distribution for the P-graphs matches in 
the asymptotic regime very well to a power law with exponent around -3, hence indicating that the mechanisms for 
coagulation of organizations into a project did not greatly change from FP2-4. 
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B. The degree distribution 

Since the degree distribution in the projection graphs is just the distribution of the size of the 2-neighborhood 
N 2 (x) := # {y : d b1, (x, y) = 2}, it is not surprising that this quantity is closely connected to the size distribution. In 
the absence of other special correlations, it can be shown (see section Hvjl that the degree distribution is determined 
by the size distribution in a rather simple way. Namely, for the case when both size distributions are scale-free 
with exponents, say a (O-size) and (P-size), the P-graph degree distribution is a superposition of two power-law 
distributions with exponents a — 1 (and cutoff given by the maximal O-size value) and (3. The same holds vice versa 
for the O-graph. 

In figs.[3]and21 we show the degree-distribution for the P- and O-graphs in a log-log plot. . While the organization 
graphs for FP2-4 show a clear power law, the picture for the project graphs is more complicated. As previously 
mentioned, the P-graph degree distribution shows two different power laws, one for the initial segment up to degree 
150 and another one for large degrees. Nevertheless, there is still a widely scattered heavy tail in the degree distribution. 
The deviation from a power law in the P-graphs indicates a kind of anticorrelation: large projects above a size of 
15 are mainly formed by organizations of small size. A possible explanation is that large projects have a time- and 
resource-demanding intrinsic network structure, making it more unlikely that a participating organization has other 
projects (of course, with the exception of hub- like organizations such as CNRS with a priori unlimited capacity). 

C. Clustering, correlation and edge multiplicity 

By their construction process, intersection graphs have a naturally high clustering coefficient. This is easily seen, 
since an organization which participates in, say, k projects generates a complete subgraph of order k in the P- 
graph amongst these projects. If the probability for an organization to be in more than one project is asymp- 
totically bound away from zero, it follows that the P-graph (and similarly for the O-graph through an analo- 
gous argument) has a nonvanishing clustering coefficient. In the present study, we focus on the triangle number 
A (x) :— # {triangles containing x; x <E (V or O)} as a measure of local clustering. We define the degree-conditional 
mean triangle number as := E {A (x) \ d(x) = k}. As seen in figs. 03 and El we have A^ ~ k for both graph types. 

There is a good explanation for this type of behavior in the framework of intersection graphs (see section IIVII . 
As noted above, high clustering in intersection graphs is not necessarily an indication of local correlations between 
vertices. This is already seen in the case of an Erdos-Renyi random bipartite graph where an edge between any 
project and organization is drawn i.i.d. with probability p. If V and O are of equal cardinality N and p = jj, the 
expected bipartite degree equals c. For large N a typical realization of the random graph looks locally like a tree 
with branching number c — 1. However, for the projection graphs, we obtain an positive clustering coefficient that 
is independent of N, since most projects and organizations cause complete graphs of order c and a typical vertex is 
therefore a member of ~ c cliques of order c. 

A better indication for the presence of correlations is given by the so-called multiplicity of edges. For a link between 
two organizations or projects it is sufficient to have just one project or organization, respectively, in common, but of 
course there could be more. Given an edge x ~ y, we define m (x, y) := \x f~l y\ — 1 and call it the multiplicity of the 
edge. As will be discussed in the next section, random intersection graphs without local search rules can nevertheless 
admit a high edge multiplicity. In fig. |7| and |H1 the multiplicity distribution is shown for P- and O-graphs of FP2-4. 
There is an almost perfect power-law behavior with exponent 4.3. Note that positive multiplicity in the projection 
graphs translates in the bipartite graph picture into the presence of cycles of length four. The presence of exceptionally 
high multiplicity in the P-graphs may be caused by memory effects due to prior collaborative experience. Also, a 
greater edge multiplicity may result from the fact that organizations are active in a wider set of complementary 
activities. In this case, intra-organizational spillovers may also be of importance as search for potential partners may 
be influenced by the collaboration behavior of other actors within an organization. Such effects should be detectable 
from a fine structure analysis of the time evolution of the corresponding graphs. 

D. Diameter and mean path length 

There is essentially no difference in the diameter value of the largest component in the four Framework Program 
networks. A classical random graph of the same size and the same edge number would have a diameter about 
logjiV. The mean path length is about a third of the diameter and and shows a slightly higher variation between 
the different framework programs. It is well known that the expected path length in random graphs with a scale free 
degree distribution and exponent less than 3 is essentially independent of the graph size (the diameter of the largest 
component still increases in N but only as log log N) . The same holds for random intersection graphs with power law 
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size and degree distributions. Since the the O-graphs seem to fall into that class, the almost constant diameter and 
path length is not surprising. Although the P-graphs do not show an asymptotic power law structure for the degree, 
there is a strong increase in the edge density from FP2 to FP4, keeping the diameter of the largest component almost 
fixed. 



IV. A RANDOM INTERSECTION GRAPH MODEL 



Intersection graphs are a natural framework for networks derived from a membership relation, such as citation 
networks, actors networks, or networks reflecting any other kind of cooperation. As previously mentioned, intersection 
graphs by construction have a high clustering coefficient. As explained below, the clique distribution of a random 
intersection graph is almost given by the size distribution of the dual graph. 



A. Random intersection graphs with given size distribution 



One of the simplest random intersection models is constructed in the following way. Knowing the size of a set to be 
constructed, we generate a random subset from a finite base set X — {ai,a2,...,ajy} of N elements, such that each 
set element is drawn i.i.d. uniformly from X. These subsets constitute the vertices of a random graph. Edges are 
defined via the set intersection property, namely we have an edge between i and j (denoted by i ~ j) if and only if the 
associated subsets Ai and Aj have nonempty intersection (to compare with earlier sections, A stands here for cither 
projects sets P or organization sets O). The size (cardinality) of the subsets is either itself a random variable drawn 
i.i.d. from a probability distribution ip(k) or given by a list {D k := # {Ai : \Ai\} — k} (where for each i a conditional 
random choice is made to which size class it belongs). For the latter case, we define again tp(k) := -^y where M is the 
total number of sets to be formed. 

Since we want to compare the model with the EU- cooperation network we are mainly interested in the situation 
when ip is an asymptotic power law distribution 

This assumption is also reasonable for many other applications where vertices are formed from a base set of elements. 
To obtain an interesting limiting random graph space, we further assume that the number of chosen subsets is C\ ■ N 
where C\ is neither too large nor too small (for FP2-4 we have about twice as many organization as projects hence 
hence C\ is either 2 or 0.5). 

A basic quantity for the analysis of intersection graphs is the conditional edge probability given the size of two 
subsets: 



P k j (N) := Pr {i ~ j | \A,\ = k and \A 3 \ = I } (2) 

= Pr{A t DAj^ttl \Ai\ = k and \AA = 1} (3) 

= 1 - ^ ' ' (4) 

(1) U 
(N-k)\(N-l)\ 

N\(N-k-l)\ { ' 
_ {N -k){N -k-l)- ... -(N-k-l + 1) 

N(N- l)(N-2) ■ ... ■ (N-l + 1) ' U 



Using the condition Ik <C N, we obtain 



(1 _ iHi _ k+1) (j - k+i-i\ 



1 _ ik+\{i-m-2) ( j_n 

-, _ N r "\)V/ 



(8) 
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With this result, we can easily calculate the conditional degree distribution for a vertex of given size. First, we estimate 
the conditional subdegree distribution with respect to a given group of vertices of size to. Here, the subdegree d m (i) 
of a vertex i is defined as the number of edges i has with vertices of size to. Clearly d(i) = Y1 d m (i) ■ We have 



^ (fc, to) := Pr {d m (i) = k\ \Ai\ = I } (10) 

= E P t{ , bll , jl = m) = G} ^(^ +o a)V( 1 -^ +o a)) -* . (U, 



K kJ \ N \N I J \ N \ N . , 

The probability that a randomly chosen vertex j has size m equals, by assumption, mQ + 2 o( i) with normalization 

constant C2 ( 1 = X! Q +o(i) )■ We therefore obtain 

m 

C x N-^\(ml ( l\\ k ( ml ( 1 N x GlN -^- k 



fc ma hv +0 U 1 -iv +0 U)) (12) 



which converges to a Poisson distribution 



^{k,m)^ C -^e-^ (13) 



with c (m) = to 1 Q ZCiC2. Since the distribution ipi (k) of the degree of vertices i with \Ai\ = I is the convolution of 
the Poisson distributions ipi (k,m), we obtain again a Poisson distribution for ipi (k) : 



1>l (k) = ^e- c ' (14) 
with q = c (to) = i • C3, where C3 = to 1_q CiC2 is a well defined constant since a > 2. It remains to estimate the 

m m 

total degree distribution -0 (fc). In [2|, conditions were given describing when a superposition of Poisson distributions 
results in a scale-free distribution. Specifically, we get the following asymptotic estimate: 

^(fcH^M^V^ (15) 



k\ 

m a+o(l) fc\ 



\" 1 ■ ( mC *> e -™ c * . (16) 



The main contribution to ip (k) comes from a rather small interval of to- values, called I ess (k). This interval has the 
property that for to 6 I ess (fc), the expectation E (d (i) | \Ai\ — m) is of order k. The exponential decay of the Poisson 
distribution guarantees that the remaining parts of the sum become arbitrarily small for large k. It is important that 
the constant c/ has a linear I— dependence since an /—proportionality with exponent larger than one would force the 
degree distribution to have gaps due to a lack of overlap of the individual Poisson distributions. We therefore obtain 
for the degree distribution a power law with the same exponent a as in the size distribution. 

Although the intersection model gives a power-law degree distribution when the size distribution is already of power- 
law type, we will not obtain a power-law distribution for the size on the dual graph unless additional assumptions 
are made on the set formation rules. It is easy to see that the size distribution on the dual graph is asymptotically 

Poisson. Since Prj>| = k} ~ (j^) U - ^r 1 ) and E{\A\) converges as well as § for M,N -> 00, 
we obtain in the limit a Poisson distribution. Nevertheless, the degree distribution on the dual graph still admits a 
scale-free part induced by the scale-free size distribution of the intersection graph. We will not discuss many of the 
details, but instead provide a simple estimation for the lower bound on the number of elements dj with d(ai) = k. 
Namely, the number of elements Oj which are members of sets Aj with \Aj \ = k is for large k and M, N >> k about 
k-M -const _ N -const _ g ince ^ ( a< ) > f or flj e Aj with \Aj\ = k, we obtain as a lower bound on the density of 

elements a* with degree greater than or equal to k (note that we assumed a > 2). This estimate holds of course only 
up to the maximal size value fc, which is in the range of the power law distribution for the set sizes \Ai \ . For larger 
fc-values there is a rapid exponential decay. 

The last argument clarifies also the situation when one wants to impose conditions on the size distribution and 
the dual size distribution. Without going into the details of the rather involved analysis, we simply state that the 
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resulting degree distribution is given by a superposition of the size distibution and the dual size distribution (the last 
one enters with an exponent reduced by one) . This explains essentially the picture for the degree distribution for the 
P-graph. 

Finally we want to discuss the mean triangle (conditioned on the degree) - degree dependence which shows a clear 
linear behavior in the empirical data. We argue that this is again a consequence of the power law distribution for the 
size. First observe that a size k element dj G Aj induces a k — 1 complete subgraph on the neighborhood vertices of 
Aj. Furthermore, each maximal k~ clique in which Aj is a member generates (fc — 1) (k — 2) /2 triangles for Aj. Since 
the size distribution of the elements ai is Poisson with expectation of, say, c and the degree of Aj is proportional to 
the size \Aj\, we obtain for the conditional expected number of triangles given the degree k: 

c 2 

Afe := E (^triangles containing A \ d (A) = k) ~ —const ■ k . (17) 

In deriving eq. 1)17(1 , we used the facts that with high probability the size of the intersection between two sets Ai and 
Aj has cardinality 1 (conditioned on the two sets having a nonempty intersection) and that the Poisson distribution 
has an exponentially decaying tail. 



B. A Molloy-Reed version of random intersection graphs and a Bernoulli type model 

We sketch the construction of random intersection graphs with given size distribution if and size distribution tp on 
the dual. The two distributions are not independent but have to fulfill the condition J2i Vp (*) — V* (*)] = 0- There are 
further restrictions on the maximal size in order to get a reasonable random graph model. Note that the problem is 
equivalent to the construction of a random bipartite graph given the degree sequence on the two partitions. 

Assign first to each set A and each element a from the base set a random size value according to the given 
distributions ip and ip. Let be the resulting set of elements with size k. Replace each element from by 
k virtual elements a^i, I — 1,2, ... ,k and form a new base set X' with all the virtual elements. The set formation 
process for the sets {Ai} is now the same as in the previous section except that each chosen virtual element a^i will be 
removed from X' when it was selected first into a set. After the sets are constructed we identify the virtual elements 
back into the original ones and define the corresponding set graph in the usual way. 

By construction the resulting size distribution on the dual graph will be given by ip as long as the probability of 
choosing two virtual elements and a,i^ m (corresponding to the same element at) is sufficiently small. To ensure 
this one has to impose restrictions on the maximal size values. It is not difficult to show that the correlation between 
the size of A and the size of an element a is multiplicative. In case of a linear relation between the number of sets N 
and the number of elements M we have 

Pr{a£,4| \A\=kA\a\ = l}~^pk-l . (18) 

To see this observe that 

„ r . | | . , j | i n „[ among the k choices to generate 4 1 , lriN 

Pr{a G A \\A = k A a = I} = l -Pr <^ ° . , . f , } (19) 

L 111 i i j ^ is no virtual a — element [ 

M* - I M* — l — l M* - k - I + 1 
_1 M* M* - 1 '"' M*-k + l ( ' 

with AI* being the number of virtual elements. The last formula has the same structure as the expression for the 
pairing probability in the previous section hence we get, for Ik <C M* and bounded first moments of the V'-distribution, 
the claimed multiplicative correlation. We note that there is also a variant of the Molloy-Reed construction which 
produces an additive size-size correlation such that Pr {a G A \ \ A\ — k A \a\ — 1} ~ const ^ _j_ ^ holds (see for 
details of the algorithm). 

We next present a simulation-based comparison of the multiplicative and additive Molley-Reed model with the FP4 
network. The input size distributions for the Molloy-Reed simulations are the same as in FP4. For completeness 
we also include the simulation results based on the simple random intersection graph model defined in the previous 
section. To make clear which size distribution is given in that case we use the notation P-model (O-model) for the 
intersection graph with fixed P (O) size distribution and denote by PO-modcl the corresponding Molloy-Reed graphs 
since both size distributions are fixed therein. Figs. l9l and 1101 show the degree distribution for the O- and P-graphs. 
There is a very good agreement over the whole range of degree values between the real FP4 network projections and 
typical samples of the multiplicative Molloy-Reed model. This is quite remarkable since a considerable bias from the 
almost independence of the Molloy-Reed model should be visible in the degree distributions. The fact that there is no 
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deviation between the degree distributions indicates that the majority of project-organization alignments is essentially 
a random process. Furthermore, the additive model reproduces the FP4 P-graph degree distribution only well for 
large degree values indicating that the correlation is indeed multiplicative. 

Two quantities measuring local correlations are the triangle-degree dependence and the distribution of edge mul- 
tiplicity introduced earlier. Fig. ^] compares the triangle-degree correlation for the O-graph. Although the overall 
picture is similar (linear dependence up to medium degree) there is a clear tendency for higher triangle numbers in 
FP4 for large degree values. Again the multiplicative version matches better with the data then the additive model. 
The edge multiplicity — again for the O-graphs — is shown in fig. 1121 The real graph has a considerably smaller value 
in the exponent and extends to almost twice as large a maximal multiplicity value. Nevertheless, both Molloy-Reed 
models show a sharp scale-free distribution for the multiplicity. This is quite surprising, since, naively, one would 
expect the probability for positive edge multiplicity to go to zero as N becomes large. In summary, one has a strong 
agreement between the real data and the multiplicative Molloy-Reed model (the comparison results for FP2 and FP3 
are almost identical to the situation with FP4 and have therefore not been depicted here). Only in the fine structure 
of clustering characteristics are some differences observed. 

Finally, we briefly outline why, under certain circumstances, almost independent models like the Molloy-Reed one 
can have a scale-free edge-multiplicity distribution. To keep the discussion as transparent as possible, we study the 
question in a pure bipartite Bernoulli model, which can be thought of as a kind of predecessor to the Cameo-model 
discussed below. 

To each vertex from the O- and P- partitions (with cardinality N and M), we assign a power-law distributed, 
positive integer parameter \i (P) and v (O) with exponents a and f3. That is we partition the P- and O-vertices into 
sets := # {P \ fj, (P) = fi} and G v := # {O | v (O) = v} such that \D^\ = an d \Q V \ = On* where Cp and 

Co are normalization constants. We further assume N — C op ■ M and put 

Vt{P~0}:=^{P)v{0) . (21) 

It is easy to see that the expected degree, conditioned on the /i or v value, is proportional to [i or ^, respectively, 
and therefore the (bipartite) degree distribution on each partition has the same exponent as [i or v. Note that the 

maximal /j, and v values are given by // max ~ Mi and ^ max ~ . 

Since the edge multiplicity in the projection graph corresponds to the number of paths of length 2 in the bipartite 
graph, we define E^ P2 ^ := E# { (P, P') : there are exactly k paths of length 2 between P and P'} and E^ P2 ^ := 

kEjf\ For fixed P and P' with parameters /i and p! the expected number of paths of lenght 2 between the two 
vertices is given by 



E 



c 



2 



V2 MMV 2 |G,| (22) 



and therefore the expected total number of 2 paths in the P-partition is 

„2 



= £ \d m \ irviE h^ 2 i G "i ( 23 ) 



N 2 

C C 2 P M 



EE 



(24) 



On the other hand, we have for the probability of an edge between P and P 1 in the P-projection graph the estimate 

Pr{P~P'} = l-rj(l-^wV) IG "' (25) 




and hence for the expected total number of edges E 



x , CpM 2 ( ^ C c 2 nn' , , 
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Several cases are now possible. For (3 > 3 and a > 2, it is easy to see that lim E ^ = 1 and higher edge multiplicities 

have essentially zero probability. 

The situation is different if either condition is violated, since in this case E^ P2 ^ — E diverges and can become of the 
same order as E. For instance, we obtain for (3 < 3, a < 2 



retpa) _ e ~ V" ° 2pM2 V 



H,H' ' k>2 



E 



C c 2 un' 



C op MvP-* 



(28) 



_ ^ const -M 2 x ^ (-1)" r /, f |-2l fe r9Q x 

- E (uu ,)<* E —uT [ const ■ M MT \ ( 29 ) 

H,H> 1 k>2 

~ £ const . tl)l M i+^ +1-2) (30) 

From the last formula, we see that the expected edge multiplicity E „ — 1 can become positive for proper choices of 
a and /3. We show that g ^ 2) < 1 under the above assumptions. Since 

E (P2 ) = yy CoClM 

~ const -M^ 2 -^ +1+ T^-P) (32) 

= const ■ M« + ?" 2 (33) 



and 



one gets 



E^Y. const - ^^M*+ fe (f+§- 2 ) , (34) 



fc! 

fc>i 



^ - 1 - E c ™ st ■ S^ M ^ (35) 

fc>2 

~ 1 - const ■ M'i'i (M ° + f -l + o (1)) (36) 

= 1 - const + o{\) . (37) 

Since the involved constant is positive we get the desired result. A more carefully analysis, which will be part of a 
forthcoming paper, shows that one also obtains a power law for the edge multiplicity, as observed in the simulations. 



C. Random intersection graphs and the "Cameo" principle 



In this section, we give a possible explanation for the appearance of power laws in the size distribution. In most 
models of complex networks with power-law like degree distributions, one assumes a kind of preferential attachment 
rule as in the Albert and Barabasi model. This makes little sense in our framework. Instead we propose a rule called 
the "Cameo Principle" hrst formulated in |2|. 

Before giving an interpretation and motivation we briefly describe the formal setting. Assign to each project a 
positive ip distributed random variable (r.v.) lu and to each organization a positive ip— distributed r.v. /i (note that, 
in contrast to section TlV Bl ip and ip are not the size distributions). We assume ip and ip to be supported on (1, oo) 
and monotone decaying as u> and n tend to infinity. On the bipartite graph an edge between an organization O and 
a project P is then formed with probability 

P °v ^(P) 'EV'-^) ^(O)'E^(O) ' 
p o 

where Cq and C\ are positive constants, a,/J £ (0, 1), and all edges are drawn independently of one another. We 
are interested in the properties of the corresponding random P and O-graphs for typical realizations of the to and n 
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variable. The word typical is here understood in the sense of the ergodic theorem, namely we assume jj^2f ^ (O) ~ 

o 

f tpi-Pdtp =: Cq 1 and ^EV'"" (P) ~ / V> X ~ Q # =: Cf 1 , where N and M are the cardinalities of the O- and 

p 

P-partitions and a and j3 are such that the integral is bounded. The above formula reduces then to 

_ co • Cp ci ■ G\ 

Po p MV>" (P) iV^ (O) ' 1 j 

The expected conditional size of a vertex is then given by 

e ^^^ = c^°'Ap) +Ci m 

and 

E(|0|| y(0 )) = ^!^ + cc . ,41) 

The interpretation behind the special form of edge probability in cq. (|38|l is the following. The u> and fi values 
describe a kind of attractivity property inherent to projects and organizations. Thinking in terms of a virtual project 
formation process the final set of organizations belonging to a project P can either join the project actively — in 
which case the [i value of P is important — or the organization more passively enters the project on the request of 
organizations already involved — in which case the attractivity w of the the corresponding organization is important. 
The attractivity of an organization could, for instance, be related to its reputation, financial strength, or quality of 
earlier projects in which the organization was involved. Extrapolating from human behavior, it is not directly the cj 
or [i value which enters the pairing probability, but rather the relative frequency of the u) or fi values: the rarer a 
property, the more attractive it becomes. This is in essence the content of the "Cameo" principle. 

The parameters a and (3 can be seen as a kind of affinity to following the above rule; for a, (3 — ► the rule is switched 
off and we recover a classical Erdos-Renyi intersection graph. In general the values of a and (3 are themselves quenched 
random variables with their own — usually unknown — distribution. As shown in Q , only the maximal a and (3 values 
matter for the resulting degree distribution of the graphs. We therefore restrict ourself in the following to constant 
values. 

Since the conditional expectation of the size values (eqs. I|40(l and I0U) are proportional to Lp~@ and ip~ a , we have to 
estimate their induced distribution. It can be shown [3j that z := tp~P (lu) is asymptotically distributed with density 

z _ ( 1+ « +0 ( 1 )) when <p (ui) decays monotone and faster than any power law to zero as ui — > oo. When cp(ui) is itself 

a power-law distribution with exponent 7, the resulting distribution for z will be ~'^y~ , ~ ( 1 )). Therefore, the 

induced distribution is always a power law and independent of the details of <p. Applying this result to our model, we 
obtain immediately a power law distribution for the size distribution on the P- and O-graphs with exponents depending 
essentially only on a and (3. It is not difficult to see that, due to the edge independence in the model definition, the 
resulting degree distributions are again of power-law type. The Cameo Ansatz hence generates in a natural way a 
bipartite graph, where both projections admit two of the main features of the FP-networks. Furthermore, we obtain 
a linear dependence of the mean triangle number Afc on the degree, as in section HV Al 

None of the models discussed in section IV can reproduce scale-free distribution of the edge multiplicity with the 
same low exponent as observed in each of the FP networks. It will be interesting to see whether the inclusion of 
memory effects like the "My friends are your friends" principle |(| will change the picture. 



V. CONCLUSIONS 



In this work, we have described research collaboration networks determined from research projects funded by 
the European Union. The networks are large in terms of size, complexity, and economic impact. We observed 
numerous characteristics known from other complex networks, including scale-free degree distribution, small diameter, 
and high clustering. Using a random intersection-graph model, we were able to reproduce many properties of the 
actual networks. The empirical and theoretical investigations together shed light on the properties of these complex 
networks, in particular that the EU-fundcd R&D networks match well with typical realizations of random graph models 
characterized by just four parameters: the size, edge number, exponent of project-projection degree distribution, and 
exponent of organization-projection degree distribution. 

In terms of real-world interpretation, the present analysis yields three major insights. First, based on the fact that 
the size distribution of projects did not change significantly between the Framework Programs, any possible changes 
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in project formation rules — which we do not know at this stage — did not affect the aggregate structure of the resulting 
research networks. Second, the fact that integration between collaborating organizations has increased over time, as 
measured by the average clustering coefficient, indicates that Europe has already been moving towards a more closely 
integrated European Research Area in the earlier Framework Programs. Finally, the fact that a sizeable number of 
organizations collaborate more than once in each Frame Program shows that there appears to be a kind of robust 
backbone structure in place, which may constitute the core of the European Research Area. 

In terms of application, the present results suggest a number of extensions. First, it is essential to learn more about 
the properties of the vertices in our networks. To what extent can they be characterized and classified? What kind of 
structural patterns emerge if we add this information? Second, we need to know more about the micro-structure of the 
networks. In which areas are the networks highly clustered and where does this clustering come from? What kind of 
subgroups can be identified? Third, we need to learn more about where the observed distribution of edge multiplicity 
comes from. Finally, it would be desirable to explicitly include edge weights into the analysis. Presumable, actors 
who collaborate more frequently are more proximate to each other than actors who collaborate only once. This may 
significantly impact the structural features we are able to observe, as well as the conclusions we might draw concerning 
the link between network structure and function. 
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FIG. 1: Distribution of project sizes. 
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FIG. 2: Distribution of organization sizes. 
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FIG. 3: Degree distribution of projects projection. 
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FIG. 4: Degree distribution of organizations projection. 
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FIG. 6: Relation between degree and number of triangles in the organizations projection. 
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FIG. 8: Distribution of edge multiplicities in the projects projection. 
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FIG. 9: Degree distribution for the O-graphs. 




FIG. 10: Degree distribution for the P-graphs. 
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FIG. 11: Triangle-degree correlation for the O-graphs. 




FIG. 12: Edge multiplicity for the O-graphs. 



Tables 



Framework Program 


budget" 


# P 


million EUR/P 


#(P >1) 6 


# o 


million EUR/O 


FP1 (1984-1988) 


3.8 


3283 


1.15 


1696 


2500 


1.52 


FP2 (1987-1991) 


5.4 


3885 


1.39 


3013 


6135 


0.88 


FP3 (1990-1994) 


6.65 


5294 


1.25 


4611 


9615 


0.69 


FP4 C (1994-1998) 


13.3 


15061 (9087) 


0.88 


11374 (8039) 


20873 


0.64 
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"billion ECU/EUR 

'projects with more then one participating organization 

C R&D projects listed in parentheses. The number excludes all projects devoted to preparatory, demonstration, and training activities. 

TABLE I: FP1-4 total budget and number of funded projects. The smaller average funding per project and org in FP4 is an 
artefact as it involves a large number of scholarships and the like, which are smaller than research projects (however, we cannot 
isolate the bias created). 



graph characteristic 


FP1 


FP2 


FP3 


FP4 


# vertices: N 


2500 


6135 


9615 


20873 


( N for larg. comp.) 


(2038) 


(5875) 


(8920) 


(20130) 


N outside larg. comp. 


462 


260 


695 


743 


# edges: M 


9557 


64300 


113693 


199965 


(# edges M larg. comp.) 


(9410) 


(64162) 


(113219) 


(199182) 


mean degree: d 


7.65 


20.96 


23.65 


19.16 


(d larg. comp.) 


(9.23) 


(21.84) 


(25.39) 


(19.79) 


maximal degree: d m ax 


140 


386 


648 


649 


mean triangles per vertex: A 


22.90 


169.70 


244.91 


146.04 


(A larg. comp.) 


(27.97) 


177.16 


263.84 


151.26 


maximal triangle-number 


966 


5295 


15128 


10730 


cluster coefficient: C 


0.57 


0.72 


0.72 


0.79 


( C larg. comp.) 


(0.67) 


(0.74) 


(0.75) 


(0.81) 


number of components 


369 


183 


455 


467 


diameter of largest component 


9 


7 1 


9 


10 


mean path length: A of I.e. 


3.70 


3.27 


3.32 


3.59 


exponent of degree distribution 


-2.1 


-2.0 


-2.0 


-2.1 


variance of degree exponent 


0.4 


0.3 


0.3 


0.3 


exponent of org-size distr. 


-2.1 


-1.9 


-1.7 


-1.8 


variance of size exponent 


0.5 


0.3 


0.5 


0.3 


mean # projects per org: E( \0\) 


2.40 


4. 87 


5.6 


6.24 


maximal size (max|0|) 


130 


82 


138 


172 



TABLE II: Basic network properties of FP1-4 organizations projection. 



graph characteristic 


FP1 


FP2 


FP3 


FP4 


# vertices: N 


3283 


3884 


5528 


9087 


( N for larg. comp.) 


(2764) 


(3662) 


(5027) 


(8566) 


N outside larg. comp. 


519 


222 


501 


521 


# edges: M 


51217 


94527 


202358 


348542 


(# edges M larg. comp.) 


(50940) 


(94471) 


(202306) 


(348474) 


mean degree: d 


31.20 


48.68 


73.20 


76.71 


(d larg. comp.) 


(36.86) 


(51.60) 


(80.49) 


(81.36) 


maximal degree: d m ax 


282 


387 


917 


771 


mean triangles per vertex: A 


774.41 


871.19 


1970.30 


2034.31 


(A larg. comp.) 


919.53 


923.98 


2167.05 


2158.03 


maximal triangle-number 


12903 


11125 


37247 


41141 


cluster coefficient: C 


0.67 


0.54 


0.44 


0.47 


( C larg. comp.) 


(0.75) 


(0.57) 


(0.48) 


(0.50) 


number of components 


369 


183 


455 


467 


diameter of largest component 


9 


7 


10 


9 


mean path length: A of I.e. 


3.24 


2.80 


2.72 


2.80 


exponent of degree distribution 


(-0.8, -3.4) 


(-0.7, -3.3) 


(-0.6, -3.7) 


(-0.3, -2.2) 


variance of degree exponent 


(0.4, 3.6) 


(0.3, 1.7) 


(0.3, 1.4) 


(0.2, 0.6) 


exponent of proj-size distr. 


-3.59 


-2.9 


-3.2 


-4.1 


variance of size exponent 


0.6 


0.4 


0.2 


0.3 


mean # orgs per project: E(|P|) 


3.15 


3.08 


3.22 


2.71 


maximal size (max \P\) 


20 


44 


73 


54 



TABLE III: Basic network properties of FP1-4 projects projection. 



